Rendered at 05:38 AM, Mar. 07, 2020
Meghan Hoyer (mhoyer@ap.org)
https://tinyurl.com/nicar2020ggplot
Being able to visualize your data is one of the most important and helpful things about R.
But learning the language of graphics and building plots is a bit complicated. You’ll notice from tutorials and online answers that not everyone uses the same syntax for ggplot – some people collapse functions, others call things explicitly while still others minimize typing by leaving out everything but the most necessary bits.
This tutorial unpacks each step separately, in the hopes that it will be slightly easier to follow the different components of graphic-building in ggplot, and to help you debug issues as you go. As you get better at plotting, there are things you can do to minimize or refactor your code, but the techniques you’ll see today are meant for you to understand each individual step of the process.
It will be MUCH easier for you to build plots if you understand the language of ggplot.
There are a number of terms and concepts you will need to understand for this:
Data = you have to refer to the dataframe where you’re pulling the data from
Aesthetic Mapping = you have to link specific variables in your data to your plot. For instance: what’s the basis of the x-axis? What’s the y-axis? Are you pulling other data in to drive the color of the chart?
Geoms = what kind of chart are you building? ggplot allows for roughly 40 different types of geoms, and additional extensions offer more – everything from bar charts to hexbins to boxplots and maps.
Scales = your data is either continuous (think numbers) or discrete (think categorization into groups) This changed as of Thursday! Now your data can be binned as well! Scale functions help control labels and colors and how values are expressed.
Coordinates = your x- and y-axes can be flipped, or can become a radial chart, or can start at non-zero. Coordinates help you control this.
Themes = what’s the general look of your plot? ggplot provides a number of themes but you can also design your own or use others.
Let’s build a basic scatterplot, step-by-step. This will show you how the pieces come together.
We’ll be using the same data that Ryan used in the first class – census data on the number of grandparents who live with and/or are responsible for their grandchildren by county.
# Load files, environment variables, libraries, etc. here
library(tidyverse)
library(scales)
library(sf)
library(leaflet)
library(tigris)
options(scipen = 999)
census <- read_csv("data/source/grandparentsR1.csv")
# add variables to this
census <- census %>%
mutate(pct_living = LivingWGrandChildren/PopOver30,
pct_resp = Resp4GrandChildren/LivingWGrandChildren,
pct_old = Resp4GrandChildren_Over60/Resp4GrandChildren,
geoid = as.character(id2))
First, we need to pull in data and map aesthetics. This will allow you to see what the aesthetics function (aes) does. You can see it creates an empty plot area with an x-axis based on the total population and a y-axis with the number of people who live with their grandchildren, based off the values in our dataset.
ggplot(census) +
aes(x = PopOver30, y = LivingWGrandChildren)
So where are our points? Well, we need to tell ggplot what kind of graph we’d like! That’s where geom_ comes in. For this, let’s create a geom_point scatterplot, with one point per county.
ggplot(census) +
aes(x = PopOver30, y = LivingWGrandChildren) +
geom_point()
Much better! So we’ve got our basic chart. Now let’s add some additional aesthetics. We can play with color or with size, for instance. In this case, I’m most interested in what share of the people living with their grandchildren are responsible for them.
ggplot(census) +
aes(x = PopOver30, y = LivingWGrandChildren, color = pct_resp) +
geom_point()
We can also add a frame of reference. Let’s plot a trendline on top of this scatter so we can see the general pattern here and better identify counties that are outliers in having more or fewer people living with their grandchildren.
ggplot(census) +
aes(x = PopOver30, y = LivingWGrandChildren, color = pct_resp) +
geom_point() +
geom_smooth()
I still don’t love that default smoothing, though. Let’s plot a more linear trendline on this.
ggplot(census) +
aes(x = PopOver30, y = LivingWGrandChildren, color = pct_resp) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Much better! We can really see some places now that stand out.
This is still an ugly, hard-to-read chart, though. Let’s use scales and themes to improve the look and readability of this. To do that, we’ll save the basic plot that we’ve made out and then add labels and themes on top of it, rather than writing one single block of code.
# Save out basic plot
gg <- ggplot(census) +
aes(x = PopOver30, y = LivingWGrandChildren, color = pct_resp * 100) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
# Add labels and titles
gglab <- gg +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
labs(x = "Population Over 30",
y = "Number Living with Grandchildren",
title = "Grandparents Residing with their Grandchildren",
caption = "Source: American Community Survey",
color = "Percent Responsible")
plot(gglab)
Almost there! Let’s clean it up a bit more by adding a theme. At the AP, we have a specific theme that applies AP’s color schemes and fonts to our charts. But for this, we’ll just use one of the built-in ggplot themes.
gglab + theme_minimal()
Now that we’ve done one type of geom_point, let’s try another very different type of chart. First we’ll need to clean up the data to take away the “Louisiana” from Geographies, as it’s redundant.
Then we’ll filter the data down to the top 20 counties based on the share of grandparents living with their grandchildren who are also responsible for them.
We’ll use forcats’ fct_reorder to tell the plot that the y axis (counties), needs to be ordered by the pct_resp column.
Finally, we’ll add a lot of formatting and styling, from changing the pct_resp numbers to percentages to putting on labels and then removing some of the grid and tick marks.
census_clean <- census %>%
separate(Geography, into = c("County", "State"), sep = ",")
p <- ggplot(census_clean %>%
arrange(-pct_resp) %>%
head(20)) +
aes(x = pct_resp, y = fct_reorder(County, pct_resp)) +
geom_point()
plab <- p +
scale_x_continuous(labels = percent) +
labs(x = "Share of grandparents responsible",
y = "",
title = "Largest share of grandparents responsible for kids",
caption = "Source: American Community Survey")
plab +
theme_minimal() +
theme(panel.grid.minor = element_blank()) +
theme(panel.grid.major.x = element_blank())
We’re going to build multiple plots. This works great if you have a graph you want to build for every state in the U.S., or every county. In this example, we’re going to create a basic histogram showing the distribution of counties over different population types.
To do this, first we need to make our data tidy. You’ll see this condenses the data into one type of population per line per county.
census_tidy <- census %>%
select(id2, Geography, TotalPop, pct_living, pct_resp, pct_old) %>%
pivot_longer(cols = (3:6), names_to = "type", values_to = "pop_in_category")
Then, let’s do a basic histogram that shows how each type is distributed.
Let’s create some labels for the facets.
labels <- c(pct_living = "Living with Grandchildren",
pct_resp = "Responsible for Grandchildren",
pct_old = "Older Responsible Grandparents")
There are a few things going on here: first, we’re filtering out “TotalPop” from the type, because it’s a number and not a percentage. Then, we build a histogram and bin the data. Finally, we add facet_grid and separate the data out over types. We set scales = free because the first - the share of people over 30 who are grandparents responsible for their grandchildren - is so much smaller than the other two metrics.
gghist <- ggplot(census_tidy %>% filter(type != "TotalPop")) +
aes(x = pop_in_category) +
geom_histogram(bins = 10) +
facet_grid(~type,
scales = "free",
labeller=labeller(type = labels))
Finally, we need to add labels and proper scales.
gghist + scale_x_continuous(label = percent) +
labs(x = "Share of Population in Category",
y = "Number of counties",
title = "Distribution of counties")
There are a million types of maps you can build in R but we’re going to create a chloropleth map built in Leaflet.
Let’s start by reading in our shapefile, just as we would a csv, but using st_read through the sf package. Once we do that, we’ll filter it for only Louisiana counties, and then do a little cleanup.
Finally, we’ll join the shapefile to our census data with tigris’ geo_join function.
county_shape <- st_read("data/source/tl_2019_us_county.shp")
## Reading layer `tl_2019_us_county' from data source `/Users/mhoyer/code/nicar-2020-r-plotting-lesson/data/source/tl_2019_us_county.shp' using driver `ESRI Shapefile'
## Simple feature collection with 3233 features and 17 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.43979
## epsg (SRID): 4269
## proj4string: +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
county_map <- county_shape %>%
filter(STATEFP == "22") %>%
mutate(geoid = as.character(GEOID))
#join the data
counties <- geo_join(county_map, census_clean, "geoid", "geoid") %>%
mutate(pct_resp = round(pct_resp * 100, 1))
With our data in hand, let’s turn to some map basics: setting a color palette and assigning colors to a variable. We’ll also build the popup.
pal <- colorNumeric("Blues", domain=(counties$pct_resp))
# Setting up the popup text
popup_text <- paste0("<b>", counties$County, "</b>", "</br/>Of the grandparents living with their grandchildren, the share responsible for raising them is \n", as.character(counties$pct_resp), "%")
Finally, we actually make our map.
leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
setView(-92.329102, 30.391830, zoom = 7) %>%
addPolygons(data = counties,
fillColor = ~pal(pct_resp),
fillOpacity = 1,
weight = 0.9,
smoothFactor = 0.2,
stroke=TRUE,
color="white",
popup = ~popup_text) %>%
addLegend(pal = pal,
values = counties$pct_resp,
position = "bottomleft",
title = "Share of grandparents </br/>
responsible")
Finally, we can save out our map as a basic html page that you can share with someone else. You’ll have to install and load the htmlwidgets library to do this, but here’s the code. First you run the same code as above, but save it as a leaflet element.
map <- leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
setView(-92.329102, 30.391830, zoom = 7) %>%
addPolygons(data = counties,
fillColor = ~pal(pct_resp),
fillOpacity = 1,
weight = 0.9,
smoothFactor = 0.2,
stroke=TRUE,
color="white",
popup = ~popup_text) %>%
addLegend(pal = pal,
values = counties$pct_resp,
position = "bottomleft",
title = "Share of grandparents </br/>
responsible")
Then, you use this command to save out the map:
saveWidget(map, file=“data/grandparent_map.html”)